Blue Nile Mock Project¶

Business Case¶

Case is to

  • To segment the customers as Gifters & Self-Gifters on the client's existing application and
  • To segment their customer base into different homogenous groups

This behavioral study was to be transferred to marketing team that will help them target the group of customers and personalize their user experience according to the study.

The Solution¶

Solution is to base on Customer buying pattern for the Gifter & Self-Gifter Segmentation.
To segment their customer base into different homogeneous groups, RFM(Recency, Frequency, Monetary) Analysis is used.

  • These techniques helped in identifying key differentiators that divided customers into groups that can be targeted.
  • Behavioral (spending, consumption, usage) tendencies were considered when determining customer segmentation.
  • RFM helped in grouping the customers based on their transaction history which included how recently, how often, how much did they buy and many more.
In [1]:
# Importing required libraries for analysis
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
In [2]:
# Reading the data we have
sales_df = pd.read_csv(r'C:\Users\affine\Downloads\Mock Project\Data\sj_q1.csv',low_memory = False)    #Sales Data
email_df = pd.read_csv(r'C:\Users\affine\Downloads\Mock Project\Data\sj_q2.csv')                       #Email Address Data
customer_df = pd.read_csv(r'C:\Users\affine\Downloads\Mock Project\Data\sj_q3.csv')                    #Customer Data
product_df = pd.read_csv(r'C:\Users\affine\Downloads\Mock Project\Data\sj_q4.csv')                     #Product Data
address_df = pd.read_csv(r'C:\Users\affine\Downloads\Mock Project\Data\sj_q5.csv')                     #Address Data
#setting low_memory = False to avoid dtypewarning for some rows having mixed datatypes
In [3]:
address_df.shape
Out[3]:
(1048575, 5)
In [4]:
# Checking if email address key is null for any order as it represents a customer
sales_df['email_address_key'].isnull().values.any()
Out[4]:
False

We don't have any null values in Email Address Key column which represents the customer.

In [5]:
# Checking the sales data shape to know the number of rows and columns it have
sales_df.shape
Out[5]:
(808460, 29)
In [6]:
# Getting info of how many customers we have in total
sales_df.email_address_key.nunique()
Out[6]:
454317
In [7]:
# Checking if there are any duplicates in sales data as it's our primary datasource
duplicates = sales_df[sales_df.duplicated()].sort_values('email_address_key',ascending=True)
duplicates
Out[7]:
merch_category_rollup merch_sub_category_rollup MERCH_PRODUCT_CATEGORY email_address_key basket_id original_basket_id order_type site basket_start_date submit_date ... local_currency_after_disc_sale_amount usd_after_disc_sale_amount usd_after_disc_cost_amount usd_exchange_credit_amount quantity DIAMOND_SKU_1 DIAMOND_SKU_2 DIAMOND_SKU_4 DIAMOND_SKU_5 OFFER_ID
332453 Other Jewelry Misc Misc 1227859 87307460 87307460 STANDARD BN 21-04-2021 21-04-2021 ... 27.0 27.00 0.0000 0.0 0 NaN NaN NaN NaN 10478.0
331935 Other Jewelry Misc Misc 1227859 48610624 48610624 STANDARD BN 03-04-2021 02-04-2021 ... 27.0 27.00 0.0000 0.0 0 NaN NaN NaN NaN 10478.0
657035 Other Jewelry Metals Silver 1264396 34050926 34050926 STANDARD BN 03-06-2021 03-06-2021 ... 87.5 87.50 35.3208 0.0 1 NaN NaN NaN NaN 44984.0
269593 Other Jewelry Metals Silver 1277299 94672258 94672258 STANDARD BN 10-12-2020 10-12-2020 ... 75.0 75.00 30.0294 0.0 1 NaN NaN NaN NaN 18138.0
285034 Other Jewelry Misc Misc 1277299 94672258 94672258 STANDARD BN 10-12-2020 10-12-2020 ... 6.0 6.00 0.0000 0.0 0 NaN NaN NaN NaN 10476.0
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
304940 Other Jewelry Misc Misc 31489201 19676740 19676740 STANDARD BN 02-08-2022 02-08-2022 ... 30.0 30.00 0.0000 0.0 0 NaN NaN NaN NaN 10478.0
315911 Other Jewelry Misc Misc 31489327 59724594 59724594 STANDARD BN 31-07-2022 02-08-2022 ... 30.0 30.00 0.0000 0.0 0 NaN NaN NaN NaN 10478.0
313273 Other Jewelry Misc Misc 31489437 69892179 69892179 STANDARD BNCA 02-08-2022 02-08-2022 ... 43.0 33.66 0.0000 0.0 0 NaN NaN NaN NaN 10478.0
314112 Other Jewelry Misc Misc 31490510 60739028 60739028 STANDARD BN 03-08-2022 03-08-2022 ... 30.0 30.00 0.0000 0.0 0 NaN NaN NaN NaN 10478.0
313222 Other Jewelry Misc Misc 31490614 87155101 87155101 STANDARD BN 03-08-2022 03-08-2022 ... 30.0 30.00 0.0000 0.0 0 NaN NaN NaN NaN 10478.0

7558 rows × 29 columns

We have 7,558 duplicated rows in the dataset.

In [8]:
# Removing duplicated rows from the sales data
non_dup_df = sales_df.drop_duplicates()
non_dup_df
Out[8]:
merch_category_rollup merch_sub_category_rollup MERCH_PRODUCT_CATEGORY email_address_key basket_id original_basket_id order_type site basket_start_date submit_date ... local_currency_after_disc_sale_amount usd_after_disc_sale_amount usd_after_disc_cost_amount usd_exchange_credit_amount quantity DIAMOND_SKU_1 DIAMOND_SKU_2 DIAMOND_SKU_4 DIAMOND_SKU_5 OFFER_ID
0 NaN NaN NaN 16983312 81947282 81947282 STANDARD BN 19-09-2020 19-09-2020 ... 5558.00000 5558.0 3726.6400 0.0 1 LD12713965 LD12142705 LD12681121 LD13880254 NaN
1 Engagement Engagement BYO Semi-Mounts 22822442 59639029 75113342 EXCHANGE BN 21-09-2021 21-09-2021 ... 2149.00000 2149.0 1219.2000 2149.0 1 LD14946369 NaN NaN NaN NaN
2 Engagement Engagement BYO Semi-Mounts 19696924 23989389 23989389 SPECIAL ORDER BN 06-06-2021 07-06-2021 ... 5621.00000 5621.0 3868.9700 0.0 1 LD16111745 NaN NaN NaN NaN
3 Engagement Engagement BYO Semi-Mounts 24120678 12758238 12758238 STANDARD BN 20-09-2021 20-09-2021 ... 2759.00000 2759.0 1776.2600 0.0 1 LD16655615 NaN NaN NaN NaN
4 Engagement Engagement BYO Semi-Mounts 23536382 84045443 84045443 SPECIAL ORDER BN 21-12-2020 21-12-2020 ... 2622.00000 2622.0 1305.7100 0.0 1 LD14659784 NaN NaN NaN NaN
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
808455 Diamond Jewelry Diamond Jewelry Basic BYO BYO Stud Metal 23036258 22989040 22989040 STANDARD BN 15-09-2020 15-09-2020 ... 2380.00000 2380.0 1749.0308 0.0 1 LD14113460 LD14252074 NaN NaN NaN
808456 Diamond Jewelry Diamond Jewelry Basic BYO BYO Stud Metal 10708737 78485771 78485771 STANDARD BN 16-02-2021 16-02-2021 ... 2357.00001 2357.0 1766.8096 0.0 1 LD14409557 LD14407739 NaN NaN NaN
808457 Engagement Engagement BYO 3-Stone Metal 19190442 64237203 64237203 STANDARD BN 08-10-2020 08-10-2020 ... 6235.00001 6235.0 4448.4775 0.0 1 LD14098852 LD12868926 NaN NaN NaN
808458 Diamond Jewelry Diamond Jewelry Basic BYO BYO Stud Metal 18101292 51603645 51603645 STANDARD BN 22-08-2020 22-08-2020 ... 3544.99999 3545.0 2768.8646 0.0 1 LD14100562 LD13980487 NaN NaN NaN
808459 NaN NaN NaN 24634277 71136882 71136882 STANDARD BN 23-01-2022 08-02-2022 ... 5300.00001 5300.0 3258.2825 0.0 1 LD17179489 LD17153838 LD16411216 LD16158119 NaN

800902 rows × 29 columns

We now have about 8lakh distinct orders.

In [9]:
# Checking if there are same number of customers after removing duplicates
non_dup_df.email_address_key.nunique()
Out[9]:
454317

Customers before and after removing the duplicates should remain the same and those both came out to be same.

Data with Non Returns in orders¶

We need to remove the orders which are returned.

In [10]:
# Orders to keep only non-return orders
non_returns = non_dup_df[non_dup_df['return_date'].isnull()]
non_returns
Out[10]:
merch_category_rollup merch_sub_category_rollup MERCH_PRODUCT_CATEGORY email_address_key basket_id original_basket_id order_type site basket_start_date submit_date ... local_currency_after_disc_sale_amount usd_after_disc_sale_amount usd_after_disc_cost_amount usd_exchange_credit_amount quantity DIAMOND_SKU_1 DIAMOND_SKU_2 DIAMOND_SKU_4 DIAMOND_SKU_5 OFFER_ID
0 NaN NaN NaN 16983312 81947282 81947282 STANDARD BN 19-09-2020 19-09-2020 ... 5558.00000 5558.0 3726.6400 0.0 1 LD12713965 LD12142705 LD12681121 LD13880254 NaN
1 Engagement Engagement BYO Semi-Mounts 22822442 59639029 75113342 EXCHANGE BN 21-09-2021 21-09-2021 ... 2149.00000 2149.0 1219.2000 2149.0 1 LD14946369 NaN NaN NaN NaN
2 Engagement Engagement BYO Semi-Mounts 19696924 23989389 23989389 SPECIAL ORDER BN 06-06-2021 07-06-2021 ... 5621.00000 5621.0 3868.9700 0.0 1 LD16111745 NaN NaN NaN NaN
3 Engagement Engagement BYO Semi-Mounts 24120678 12758238 12758238 STANDARD BN 20-09-2021 20-09-2021 ... 2759.00000 2759.0 1776.2600 0.0 1 LD16655615 NaN NaN NaN NaN
4 Engagement Engagement BYO Semi-Mounts 23536382 84045443 84045443 SPECIAL ORDER BN 21-12-2020 21-12-2020 ... 2622.00000 2622.0 1305.7100 0.0 1 LD14659784 NaN NaN NaN NaN
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
808455 Diamond Jewelry Diamond Jewelry Basic BYO BYO Stud Metal 23036258 22989040 22989040 STANDARD BN 15-09-2020 15-09-2020 ... 2380.00000 2380.0 1749.0308 0.0 1 LD14113460 LD14252074 NaN NaN NaN
808456 Diamond Jewelry Diamond Jewelry Basic BYO BYO Stud Metal 10708737 78485771 78485771 STANDARD BN 16-02-2021 16-02-2021 ... 2357.00001 2357.0 1766.8096 0.0 1 LD14409557 LD14407739 NaN NaN NaN
808457 Engagement Engagement BYO 3-Stone Metal 19190442 64237203 64237203 STANDARD BN 08-10-2020 08-10-2020 ... 6235.00001 6235.0 4448.4775 0.0 1 LD14098852 LD12868926 NaN NaN NaN
808458 Diamond Jewelry Diamond Jewelry Basic BYO BYO Stud Metal 18101292 51603645 51603645 STANDARD BN 22-08-2020 22-08-2020 ... 3544.99999 3545.0 2768.8646 0.0 1 LD14100562 LD13980487 NaN NaN NaN
808459 NaN NaN NaN 24634277 71136882 71136882 STANDARD BN 23-01-2022 08-02-2022 ... 5300.00001 5300.0 3258.2825 0.0 1 LD17179489 LD17153838 LD16411216 LD16158119 NaN

677956 rows × 29 columns

We now have 6,77,956 non-returned orders.

In [11]:
# Checking if the above data has only non-return orders
non_returns['return_date'].isnull().sum()
Out[11]:
677956

As number of non-returned orders is equal to null values sum in return date column, the above data consists only of non-return orders.

In [12]:
# Merging Sales all the Datasets we have with Non-Returned orders dataset
sales_email = pd.merge(non_returns, email_df, on = 'email_address_key', how = 'left')
se_cust = pd.merge(sales_email,customer_df.drop_duplicates(['email_address']), left_on = 'email_Address',
                   right_on = 'email_address', how = 'left') 
sec_prod = pd.merge(se_cust, product_df, left_on ='OFFER_ID', right_on = 'OFFER_KEY', how='left')
final_merged = pd.merge(sec_prod,address_df.drop_duplicates(['basket_id']), left_on=['basket_id'],right_on=['basket_id'], 
                        how='left')

final_merged
Out[12]:
merch_category_rollup merch_sub_category_rollup MERCH_PRODUCT_CATEGORY email_address_key basket_id original_basket_id order_type site basket_start_date submit_date ... derived_gender age OFFER_KEY_x name TARGET_GENDER display_type PRODUCT_KEY OFFER_KEY_y BILL_TO_ADDRESS_KEY SHIP_TO_ADDRESS_KEY
0 NaN NaN NaN 16983312 81947282 81947282 STANDARD BN 19-09-2020 19-09-2020 ... M 55.0 NaN NaN NaN NaN 2551021.0 62363.0 10474539.0 10474540.0
1 Engagement Engagement BYO Semi-Mounts 22822442 59639029 75113342 EXCHANGE BN 21-09-2021 21-09-2021 ... NaN NaN NaN NaN NaN NaN 3533894.0 76208.0 11277556.0 11030807.0
2 Engagement Engagement BYO Semi-Mounts 19696924 23989389 23989389 SPECIAL ORDER BN 06-06-2021 07-06-2021 ... M 31.0 NaN NaN NaN NaN 3565789.0 76212.0 11129079.0 11129080.0
3 Engagement Engagement BYO Semi-Mounts 24120678 12758238 12758238 STANDARD BN 20-09-2021 20-09-2021 ... M 37.0 NaN NaN NaN NaN 3601472.0 76207.0 11329644.0 11178488.0
4 Engagement Engagement BYO Semi-Mounts 23536382 84045443 84045443 SPECIAL ORDER BN 21-12-2020 21-12-2020 ... NaN NaN NaN NaN NaN NaN 3468286.0 76897.0 10759761.0 10759762.0
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
677951 Diamond Jewelry Diamond Jewelry Basic BYO BYO Stud Metal 23036258 22989040 22989040 STANDARD BN 15-09-2020 15-09-2020 ... M NaN NaN NaN NaN NaN 2058088.0 51574.0 10466130.0 10466130.0
677952 Diamond Jewelry Diamond Jewelry Basic BYO BYO Stud Metal 10708737 78485771 78485771 STANDARD BN 16-02-2021 16-02-2021 ... M 39.0 NaN NaN NaN NaN 3031615.0 50614.0 10893648.0 10893649.0
677953 Engagement Engagement BYO 3-Stone Metal 19190442 64237203 64237203 STANDARD BN 08-10-2020 08-10-2020 ... M 60.0 NaN NaN NaN NaN 2098637.0 49876.0 10504692.0 10504692.0
677954 Diamond Jewelry Diamond Jewelry Basic BYO BYO Stud Metal 18101292 51603645 51603645 STANDARD BN 22-08-2020 22-08-2020 ... M NaN NaN NaN NaN NaN 2058088.0 51574.0 10422395.0 10412311.0
677955 NaN NaN NaN 24634277 71136882 71136882 STANDARD BN 23-01-2022 08-02-2022 ... F NaN NaN NaN NaN NaN 2551032.0 62363.0 11698964.0 11698965.0

677956 rows × 43 columns

Non-returned orders and Final Merged Orders have the same the number of rows and i.e., 6,77,956.

In [13]:
#checking if there are customers whose email address is not there in our dataset
pd.DataFrame(final_merged, columns=['email_address_key', 'email_Address']).nunique()
Out[13]:
email_address_key    417027
email_Address        402782
dtype: int64

As Email Address Key and Email Address values doesn't match, there were some customers whose email address is not there in the dataset.
We need to remove those customer orders as we can't make any marketing campaigns or others without knowing Email address of a customer.

In [14]:
# Dropping the rows that have null values in email address column
final_merged = final_merged[final_merged['email_Address'].notna()]
final_merged
Out[14]:
merch_category_rollup merch_sub_category_rollup MERCH_PRODUCT_CATEGORY email_address_key basket_id original_basket_id order_type site basket_start_date submit_date ... derived_gender age OFFER_KEY_x name TARGET_GENDER display_type PRODUCT_KEY OFFER_KEY_y BILL_TO_ADDRESS_KEY SHIP_TO_ADDRESS_KEY
0 NaN NaN NaN 16983312 81947282 81947282 STANDARD BN 19-09-2020 19-09-2020 ... M 55.0 NaN NaN NaN NaN 2551021.0 62363.0 10474539.0 10474540.0
2 Engagement Engagement BYO Semi-Mounts 19696924 23989389 23989389 SPECIAL ORDER BN 06-06-2021 07-06-2021 ... M 31.0 NaN NaN NaN NaN 3565789.0 76212.0 11129079.0 11129080.0
3 Engagement Engagement BYO Semi-Mounts 24120678 12758238 12758238 STANDARD BN 20-09-2021 20-09-2021 ... M 37.0 NaN NaN NaN NaN 3601472.0 76207.0 11329644.0 11178488.0
5 Engagement Engagement BYO Semi-Mounts 24092029 18812838 18812838 STANDARD BN 11-07-2021 18-08-2021 ... M 34.0 NaN NaN NaN NaN 2604603.0 60452.0 11264941.0 11264941.0
6 Engagement Engagement BYO Semi-Mounts 24079756 47744213 47744213 STANDARD BN 09-06-2021 09-06-2021 ... M NaN NaN NaN NaN NaN 2888321.0 63046.0 11137487.0 11135306.0
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
677951 Diamond Jewelry Diamond Jewelry Basic BYO BYO Stud Metal 23036258 22989040 22989040 STANDARD BN 15-09-2020 15-09-2020 ... M NaN NaN NaN NaN NaN 2058088.0 51574.0 10466130.0 10466130.0
677952 Diamond Jewelry Diamond Jewelry Basic BYO BYO Stud Metal 10708737 78485771 78485771 STANDARD BN 16-02-2021 16-02-2021 ... M 39.0 NaN NaN NaN NaN 3031615.0 50614.0 10893648.0 10893649.0
677953 Engagement Engagement BYO 3-Stone Metal 19190442 64237203 64237203 STANDARD BN 08-10-2020 08-10-2020 ... M 60.0 NaN NaN NaN NaN 2098637.0 49876.0 10504692.0 10504692.0
677954 Diamond Jewelry Diamond Jewelry Basic BYO BYO Stud Metal 18101292 51603645 51603645 STANDARD BN 22-08-2020 22-08-2020 ... M NaN NaN NaN NaN NaN 2058088.0 51574.0 10422395.0 10412311.0
677955 NaN NaN NaN 24634277 71136882 71136882 STANDARD BN 23-01-2022 08-02-2022 ... F NaN NaN NaN NaN NaN 2551032.0 62363.0 11698964.0 11698965.0

655209 rows × 43 columns

After dropping those rows, we have 6,55,209 orders.

In [15]:
# Checking if all the rows without email address were removed are not
pd.DataFrame(final_merged, columns=['email_address_key', 'email_Address']).nunique()
Out[15]:
email_address_key    402782
email_Address        402782
dtype: int64

As the count of both columns is equal, all those rows were dropped and now we have 4,02,782 distinct customers.

In [16]:
# Pie chart representation of Distribution of Customers and Total Orders over different Categories

cust_dist = final_merged.groupby('merch_category_rollup').agg({'merch_category_rollup':'first', 'email_address_key':'nunique'})
order_dist = final_merged.groupby('merch_category_rollup').agg({'merch_category_rollup':'first', 'basket_id':'nunique'})

a, b = cust_dist.index, order_dist.index
c, d = cust_dist['email_address_key'], order_dist['basket_id']

fig = plt.figure(figsize=(18,10), dpi=1600)

# Customer Distribution Plot
ax1 = plt.subplot2grid((2,2),(0,0))
plt.pie(c, labels = a, radius = 1.5,textprops = {"fontsize":15}, autopct = "%1.1f%%", 
        wedgeprops = {"edgecolor":"black",'linewidth':0.5})
plt.title('Customers Distribution', y = 1.15)

# Order Distribution Plot
ax1 = plt.subplot2grid((2,2), (0,1))
plt.pie(d, labels = b, radius = 1.5,textprops = {"fontsize":15}, autopct = "%1.1f%%", 
        wedgeprops = {"edgecolor":"black",'linewidth':0.5})
plt.title('Orders Distribution', y = 1.15)

plt.show()
  • About 85% customers are buying products from Jewelry category and 15% are of Engagement Category.
  • 86% orders placed among total orders are of Jewelry category.
In [17]:
# Checking data types of columns in the data
final_merged.dtypes
Out[17]:
merch_category_rollup                      object
merch_sub_category_rollup                  object
MERCH_PRODUCT_CATEGORY                     object
email_address_key                           int64
basket_id                                   int64
original_basket_id                          int64
order_type                                 object
site                                       object
basket_start_date                          object
submit_date                                object
order_date                                 object
ship_date                                  object
ship_delivery_date                         object
return_date                                object
discount_name                              object
local_currency                             object
promo_discount_applied_flag                object
local_currency_before_disc_sale_amount    float64
discount_promo_amount                     float64
local_currency_after_disc_sale_amount     float64
usd_after_disc_sale_amount                float64
usd_after_disc_cost_amount                float64
usd_exchange_credit_amount                float64
quantity                                    int64
DIAMOND_SKU_1                              object
DIAMOND_SKU_2                              object
DIAMOND_SKU_4                              object
DIAMOND_SKU_5                              object
OFFER_ID                                  float64
email_Address                              object
email_address                              object
first_name                                 object
last_name                                  object
derived_gender                             object
age                                       float64
OFFER_KEY_x                               float64
name                                       object
TARGET_GENDER                              object
display_type                               object
PRODUCT_KEY                               float64
OFFER_KEY_y                               float64
BILL_TO_ADDRESS_KEY                       float64
SHIP_TO_ADDRESS_KEY                       float64
dtype: object
In [18]:
# Filling na values of shipping, billing address columns with -1 to preserve the int datatype and to distinguish null values
final_merged[['SHIP_TO_ADDRESS_KEY','BILL_TO_ADDRESS_KEY']] = final_merged[['SHIP_TO_ADDRESS_KEY','BILL_TO_ADDRESS_KEY']].fillna(-1)

# Changing needed columns dtypes 
final_merged = final_merged.astype({"SHIP_TO_ADDRESS_KEY" : int, "BILL_TO_ADDRESS_KEY" : int, 
                                    "email_address_key" : object, "basket_id" : object})
final_merged["submit_date"] = pd.to_datetime(final_merged["submit_date"], dayfirst = True)
final_merged["order_date"]  = pd.to_datetime(final_merged["order_date"], dayfirst = True)

# Checking for dtypes after change
pd.DataFrame(final_merged, columns=['SHIP_TO_ADDRESS_KEY','BILL_TO_ADDRESS_KEY','basket_id','submit_date','order_date']).dtypes
C:\Users\affine\AppData\Local\Temp\ipykernel_16568\2691718885.py:2: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  final_merged[['SHIP_TO_ADDRESS_KEY','BILL_TO_ADDRESS_KEY']] = final_merged[['SHIP_TO_ADDRESS_KEY','BILL_TO_ADDRESS_KEY']].fillna(-1)
Out[18]:
SHIP_TO_ADDRESS_KEY             int32
BILL_TO_ADDRESS_KEY             int32
basket_id                      object
submit_date            datetime64[ns]
order_date             datetime64[ns]
dtype: object

Sales Report¶

In [19]:
# Getting shape of our final jewelry dataset
final_merged.shape
Out[19]:
(655209, 43)
In [20]:
list(final_merged.columns)
Out[20]:
['merch_category_rollup',
 'merch_sub_category_rollup',
 'MERCH_PRODUCT_CATEGORY',
 'email_address_key',
 'basket_id',
 'original_basket_id',
 'order_type',
 'site',
 'basket_start_date',
 'submit_date',
 'order_date',
 'ship_date',
 'ship_delivery_date',
 'return_date',
 'discount_name',
 'local_currency',
 'promo_discount_applied_flag',
 'local_currency_before_disc_sale_amount',
 'discount_promo_amount',
 'local_currency_after_disc_sale_amount',
 'usd_after_disc_sale_amount',
 'usd_after_disc_cost_amount',
 'usd_exchange_credit_amount',
 'quantity',
 'DIAMOND_SKU_1',
 'DIAMOND_SKU_2',
 'DIAMOND_SKU_4',
 'DIAMOND_SKU_5',
 'OFFER_ID',
 'email_Address',
 'email_address',
 'first_name',
 'last_name',
 'derived_gender',
 'age',
 'OFFER_KEY_x',
 'name',
 'TARGET_GENDER',
 'display_type',
 'PRODUCT_KEY',
 'OFFER_KEY_y',
 'BILL_TO_ADDRESS_KEY',
 'SHIP_TO_ADDRESS_KEY']
In [21]:
# Finding how many customers we have
final_merged.email_address_key.nunique()
Out[21]:
402782

The total number of unique customers are 4,02,782.

In [22]:
# Getting data of final_merged
final_merged.head()
Out[22]:
merch_category_rollup merch_sub_category_rollup MERCH_PRODUCT_CATEGORY email_address_key basket_id original_basket_id order_type site basket_start_date submit_date ... derived_gender age OFFER_KEY_x name TARGET_GENDER display_type PRODUCT_KEY OFFER_KEY_y BILL_TO_ADDRESS_KEY SHIP_TO_ADDRESS_KEY
0 NaN NaN NaN 16983312 81947282 81947282 STANDARD BN 19-09-2020 2020-09-19 ... M 55.0 NaN NaN NaN NaN 2551021.0 62363.0 10474539 10474540
2 Engagement Engagement BYO Semi-Mounts 19696924 23989389 23989389 SPECIAL ORDER BN 06-06-2021 2021-06-07 ... M 31.0 NaN NaN NaN NaN 3565789.0 76212.0 11129079 11129080
3 Engagement Engagement BYO Semi-Mounts 24120678 12758238 12758238 STANDARD BN 20-09-2021 2021-09-20 ... M 37.0 NaN NaN NaN NaN 3601472.0 76207.0 11329644 11178488
5 Engagement Engagement BYO Semi-Mounts 24092029 18812838 18812838 STANDARD BN 11-07-2021 2021-08-18 ... M 34.0 NaN NaN NaN NaN 2604603.0 60452.0 11264941 11264941
6 Engagement Engagement BYO Semi-Mounts 24079756 47744213 47744213 STANDARD BN 09-06-2021 2021-06-09 ... M NaN NaN NaN NaN NaN 2888321.0 63046.0 11137487 11135306

5 rows × 43 columns

In [23]:
# Creating a Sales Report with necessary columns
report = final_merged.rename(columns = {'derived_gender':'customer_gender'})[['email_address_key','TARGET_GENDER',
                                                                              'customer_gender','age','basket_id',
                                                                              'usd_after_disc_sale_amount']]
report.head()
Out[23]:
email_address_key TARGET_GENDER customer_gender age basket_id usd_after_disc_sale_amount
0 16983312 NaN M 55.0 81947282 5558.0
2 19696924 NaN M 31.0 23989389 5621.0
3 24120678 NaN M 37.0 12758238 2759.0
5 24092029 NaN M 34.0 18812838 3178.0
6 24079756 NaN M NaN 47744213 1668.0
In [24]:
# Grouping orders and sales for each customer 
col = {'basket_id':'no_of_orders', 'usd_after_disc_sale_amount':'sum_of_sales'}
report = report.groupby('email_address_key').agg({'customer_gender':'first','age':'first','basket_id':'nunique', 
                                                  'usd_after_disc_sale_amount':'sum'}).rename(columns = col)

# Sorting by email address key for better understanding
report.sort_values(by=['email_address_key'], ascending = True)
Out[24]:
customer_gender age no_of_orders sum_of_sales
email_address_key
692 F 46.0 1 862.50
959 F NaN 1 8172.69
1084 M 47.0 1 2242.50
1359 M 62.0 2 1041.00
9775 M NaN 1 3435.00
... ... ... ... ...
31490633 M NaN 1 3490.50
31490634 F NaN 1 50.54
31490635 M NaN 1 203.00
31490636 F NaN 1 1032.00
31490637 M NaN 1 160.00

402782 rows × 4 columns

This is the Sales Report for the orders, here we have data for all 4,02,782 customers like the count of orders a particular customer made so far and what's the spending amount of that customer along with that customer Gender and Age to get an overall idea.

In [25]:
# Checking unique count of orders
report.no_of_orders.unique()
Out[25]:
array([   1,    2,    3,    9,   14,    4,    6,    7,    5,    8,   10,
         13,   11,   12,   18,   16,   20,   17,   15,   21,   24,   27,
        121,   30,   25,   31,  403,   33,   22,   23,   29, 3602,   32,
         28,   38,   43,   19,  395,   26,   46,   73,   34,   65,   58,
        722,  100], dtype=int64)

We have the number of orders done by a customer ranging from 1-3602.

In [26]:
# Creating customer type column to get an Idea about customers
report['customer_type'] = np.where(report['no_of_orders'] > 1, 1, 0)
report.sort_values('sum_of_sales', ascending = False, inplace = True)
report.head()
Out[26]:
customer_gender age no_of_orders sum_of_sales customer_type
email_address_key
19666108 None NaN 3602 5989330.96 1
18082833 None NaN 403 2267026.85 1
24277775 M NaN 1 1000000.00 0
23810664 F NaN 722 910152.19 1
24668292 F NaN 2 818452.46 1
In [27]:
#checking unique values in customer type column
report.customer_type.unique()
Out[27]:
array([1, 0])

Customer Type Column Info

  • Customer Type = 1 then Frequent Customer
  • Customer Type = 0 then One time Customer
In [28]:
# Grouping customer type 
a = (report['customer_type'].value_counts(normalize=True) * 100).round(decimals=2)
b = report['customer_type'].value_counts()
a, b
Out[28]:
(0    82.45
 1    17.55
 Name: customer_type, dtype: float64,
 0    332085
 1     70697
 Name: customer_type, dtype: int64)

We have

  • 3,32,085 One time customers and
  • 70,697 Frequent Customers.

Jewelry Category Data¶

In [29]:
# Only keeping the Jewelry category data in non-returns final merged data
jewelry_df = final_merged.loc[(final_merged['merch_category_rollup'] == 'Diamond Jewelry') | 
                              (final_merged['merch_category_rollup'] == 'Other Jewelry')   |
                              (final_merged['merch_category_rollup'] == 'Bands')]
jewelry_df
Out[29]:
merch_category_rollup merch_sub_category_rollup MERCH_PRODUCT_CATEGORY email_address_key basket_id original_basket_id order_type site basket_start_date submit_date ... derived_gender age OFFER_KEY_x name TARGET_GENDER display_type PRODUCT_KEY OFFER_KEY_y BILL_TO_ADDRESS_KEY SHIP_TO_ADDRESS_KEY
41 Diamond Jewelry Diamond Jewelry Basic BYO BYO Stud Metal 31371922 45832107 45832107 STANDARD BN 30-04-2022 2022-04-30 ... F NaN NaN NaN NaN NaN 3697562.0 50619.0 11855742 11855742
42 Diamond Jewelry Diamond Jewelry Basic BYO BYO Stud Metal 12192824 26092410 26092410 STANDARD BN 11-04-2022 2022-04-11 ... M 41.0 NaN NaN NaN NaN 2058073.0 51584.0 7867151 11817631
43 Diamond Jewelry Diamond Jewelry Basic BYO BYO Stud Metal 22167265 15123721 15123721 STANDARD BN 18-10-2021 2021-10-18 ... M NaN NaN NaN NaN NaN 3240108.0 50673.0 11384809 11384810
44 Diamond Jewelry Diamond Jewelry Basic BYO BYO Stud Metal 23403248 28743548 28743548 STANDARD BN 30-11-2020 2020-11-30 ... M NaN NaN NaN NaN NaN 3031577.0 50618.0 10642691 10642692
45 Diamond Jewelry Diamond Jewelry Basic BYO BYO Stud Metal 1590647 30915560 30915560 STANDARD BN 21-02-2021 2021-02-21 ... F NaN NaN NaN NaN NaN 2058089.0 51571.0 10901412 10901412
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
677949 Diamond Jewelry Diamond Jewelry Basic BYO BYO Stud Metal 3740508 48317500 48317500 STANDARD BN 28-08-2020 2020-08-28 ... F NaN NaN NaN NaN NaN 2058073.0 51584.0 10435269 10435269
677950 Diamond Jewelry Diamond Jewelry Basic BYO BYO Stud Metal 22859237 64797963 64797963 STANDARD BN 10-08-2020 2020-08-10 ... M 42.0 NaN NaN NaN NaN 2058088.0 51574.0 10399071 10399071
677951 Diamond Jewelry Diamond Jewelry Basic BYO BYO Stud Metal 23036258 22989040 22989040 STANDARD BN 15-09-2020 2020-09-15 ... M NaN NaN NaN NaN NaN 2058088.0 51574.0 10466130 10466130
677952 Diamond Jewelry Diamond Jewelry Basic BYO BYO Stud Metal 10708737 78485771 78485771 STANDARD BN 16-02-2021 2021-02-16 ... M 39.0 NaN NaN NaN NaN 3031615.0 50614.0 10893648 10893649
677954 Diamond Jewelry Diamond Jewelry Basic BYO BYO Stud Metal 18101292 51603645 51603645 STANDARD BN 22-08-2020 2020-08-22 ... M NaN NaN NaN NaN NaN 2058088.0 51574.0 10422395 10412311

545821 rows × 43 columns

Jewelry data has about 5,45,821 orders.

In [30]:
# Checking if the categories remained in the above data do belong to only those 3 categories
jewelry_df.merch_category_rollup.unique()
Out[30]:
array(['Diamond Jewelry', 'Bands', 'Other Jewelry'], dtype=object)
In [31]:
# Columns in the jewelry data
list(jewelry_df.columns)
Out[31]:
['merch_category_rollup',
 'merch_sub_category_rollup',
 'MERCH_PRODUCT_CATEGORY',
 'email_address_key',
 'basket_id',
 'original_basket_id',
 'order_type',
 'site',
 'basket_start_date',
 'submit_date',
 'order_date',
 'ship_date',
 'ship_delivery_date',
 'return_date',
 'discount_name',
 'local_currency',
 'promo_discount_applied_flag',
 'local_currency_before_disc_sale_amount',
 'discount_promo_amount',
 'local_currency_after_disc_sale_amount',
 'usd_after_disc_sale_amount',
 'usd_after_disc_cost_amount',
 'usd_exchange_credit_amount',
 'quantity',
 'DIAMOND_SKU_1',
 'DIAMOND_SKU_2',
 'DIAMOND_SKU_4',
 'DIAMOND_SKU_5',
 'OFFER_ID',
 'email_Address',
 'email_address',
 'first_name',
 'last_name',
 'derived_gender',
 'age',
 'OFFER_KEY_x',
 'name',
 'TARGET_GENDER',
 'display_type',
 'PRODUCT_KEY',
 'OFFER_KEY_y',
 'BILL_TO_ADDRESS_KEY',
 'SHIP_TO_ADDRESS_KEY']
In [32]:
# Dropping unnecessary and repetitive columns due to joins
jewelry_df=jewelry_df.drop(['merch_category_rollup','merch_sub_category_rollup','MERCH_PRODUCT_CATEGORY','original_basket_id',
                           'order_type','site','basket_start_date','ship_date','ship_delivery_date','return_date',
                           'discount_name','local_currency','promo_discount_applied_flag',
                           'local_currency_before_disc_sale_amount','discount_promo_amount',
                           'local_currency_after_disc_sale_amount','usd_after_disc_cost_amount','usd_exchange_credit_amount',
                           'quantity','DIAMOND_SKU_1','DIAMOND_SKU_2','DIAMOND_SKU_4','DIAMOND_SKU_5','OFFER_ID',
                           'email_Address','email_address','first_name','last_name','OFFER_KEY_x','name','display_type',
                           'PRODUCT_KEY','OFFER_KEY_y'],axis=1)
jewelry_df.head()
Out[32]:
email_address_key basket_id submit_date order_date usd_after_disc_sale_amount derived_gender age TARGET_GENDER BILL_TO_ADDRESS_KEY SHIP_TO_ADDRESS_KEY
41 31371922 45832107 2022-04-30 2022-04-30 1481.0 F NaN NaN 11855742 11855742
42 12192824 26092410 2022-04-11 2022-04-11 2625.0 M 41.0 NaN 7867151 11817631
43 22167265 15123721 2021-10-18 2021-10-18 3778.0 M NaN NaN 11384809 11384810
44 23403248 28743548 2020-11-30 2020-11-30 1124.0 M NaN NaN 10642691 10642692
45 1590647 30915560 2021-02-21 2021-02-21 1543.0 F NaN NaN 10901412 10901412
In [33]:
# To get info of a particular customer
check = jewelry_df.loc[(jewelry_df['email_address_key'] == 11625)]
check.sort_values(by = ['basket_id']).head(10)
Out[33]:
email_address_key basket_id submit_date order_date usd_after_disc_sale_amount derived_gender age TARGET_GENDER BILL_TO_ADDRESS_KEY SHIP_TO_ADDRESS_KEY
96549 11625 12362718 2021-06-23 2021-06-23 5.00 F 59.0 None 10791650 11160185
628354 11625 17315472 2021-11-24 2021-11-24 359.82 F 59.0 Female 11181371 11445703
97317 11625 17315472 2021-11-24 2021-11-24 5.00 F 59.0 None 11181371 11445703
618094 11625 26449813 2021-07-06 2021-07-06 290.44 F 59.0 Female 11181371 11160185
586705 11625 26449813 2021-07-06 2021-07-06 98.25 F 59.0 Female 11181371 11160185
525602 11625 26449813 2021-07-06 2021-07-06 15.75 F 59.0 Female 11181371 11160185
631203 11625 26449813 2021-07-06 2021-07-06 19.69 F 59.0 Female 11181371 11160185
96846 11625 26449813 2021-07-06 2021-07-06 5.00 F 59.0 None 11181371 11160185
94755 11625 29295317 2021-01-05 2021-01-06 5.00 F 59.0 None 10791650 10791651
661967 11625 29295317 2021-01-05 2021-01-06 29.73 F 59.0 Female 10791650 10791651

Gifter and Self-Gifter Segmentation of Jewelry Data¶

Rules\ Gifter

  • Target Gender != Customer Gender
  • Target Gender = Customer Gender and Shipping Address != Billing Address
  • Target Gender = 'Both'/'None'/NULL and Shipping Address != Billing Address
  • Customer Gender is NULL and Shipping Address != Billing Address

Self-Gifter

  • Target Gender = Customer Gender and Shipping Address = Billing Address
  • Target Gender = Customer Gender and Shipping Address/Billing Address is NULL
  • Target Gender = 'Both'/'None'/NULL and Shipping Address = Billing Address
  • Customer Gender is NULL and Shipping Address = Billing Address

Creating Analytical Dataset for Jewelry data¶

In [34]:
# Taking the final Jewelry data for segmentation
jewelry_df.head()
Out[34]:
email_address_key basket_id submit_date order_date usd_after_disc_sale_amount derived_gender age TARGET_GENDER BILL_TO_ADDRESS_KEY SHIP_TO_ADDRESS_KEY
41 31371922 45832107 2022-04-30 2022-04-30 1481.0 F NaN NaN 11855742 11855742
42 12192824 26092410 2022-04-11 2022-04-11 2625.0 M 41.0 NaN 7867151 11817631
43 22167265 15123721 2021-10-18 2021-10-18 3778.0 M NaN NaN 11384809 11384810
44 23403248 28743548 2020-11-30 2020-11-30 1124.0 M NaN NaN 10642691 10642692
45 1590647 30915560 2021-02-21 2021-02-21 1543.0 F NaN NaN 10901412 10901412
In [35]:
# Keeping only columns that we need
analytical_df = jewelry_df.rename(columns={'email_address_key':'customer','basket_id':'order','derived_gender':'customer_gender'
                                           ,'TARGET_GENDER':'target_gender','SHIP_TO_ADDRESS_KEY':'shipping_add',
                                           'BILL_TO_ADDRESS_KEY':'billing_add'})[['customer','order','target_gender',
                                                                                  'customer_gender','shipping_add',
                                                                                  'billing_add']]
analytical_df.head()
Out[35]:
customer order target_gender customer_gender shipping_add billing_add
41 31371922 45832107 NaN F 11855742 11855742
42 12192824 26092410 NaN M 11817631 7867151
43 22167265 15123721 NaN M 11384810 11384809
44 23403248 28743548 NaN M 10642692 10642691
45 1590647 30915560 NaN F 10901412 10901412
In [36]:
# Checking unique values in target gender
analytical_df.target_gender.unique()
Out[36]:
array([nan, 'Female', 'Male', 'None', 'Both'], dtype=object)
In [37]:
# Analytical data column for target gender
analytical_df['target_gen'] = np.where(analytical_df['target_gender']=='Male',   1, 
                              np.where(analytical_df['target_gender']=='Female', 2,
                              np.where(analytical_df['target_gender']=='Both',   0, 
                              np.where(analytical_df['target_gender']=='None',   0, 0))))
analytical_df.head()
Out[37]:
customer order target_gender customer_gender shipping_add billing_add target_gen
41 31371922 45832107 NaN F 11855742 11855742 0
42 12192824 26092410 NaN M 11817631 7867151 0
43 22167265 15123721 NaN M 11384810 11384809 0
44 23403248 28743548 NaN M 10642692 10642691 0
45 1590647 30915560 NaN F 10901412 10901412 0
In [38]:
# Checking unique values in target gender analytical column
analytical_df.target_gen.unique()
Out[38]:
array([0, 2, 1])

Target Gender Analytical Column Info

  • Target Gender = 1 then Male
  • Target Gender = 2 then Female
  • Target Gender = 0 then Both/None/NULL
In [39]:
# Checking unique in customer gender
analytical_df.customer_gender.unique()
Out[39]:
array(['F', 'M', nan], dtype=object)
In [40]:
# Analytical data column for customer gender
analytical_df['customer_gen'] = np.where(analytical_df['customer_gender']=='M', 1, 
                                np.where(analytical_df['customer_gender']=='F', 2, 0))
analytical_df.head()
Out[40]:
customer order target_gender customer_gender shipping_add billing_add target_gen customer_gen
41 31371922 45832107 NaN F 11855742 11855742 0 2
42 12192824 26092410 NaN M 11817631 7867151 0 1
43 22167265 15123721 NaN M 11384810 11384809 0 1
44 23403248 28743548 NaN M 10642692 10642691 0 1
45 1590647 30915560 NaN F 10901412 10901412 0 2
In [41]:
# Checking unique values in customer gender analytical column
analytical_df.customer_gen.unique()
Out[41]:
array([2, 1, 0])

Customer Gender Analytical Column Info

  • Customer Gender = 1 then Male
  • Customer Gender = 2 then Female
  • Customer Gender = 0 then NULL
In [42]:
# Greating a separate dataset for null rows(filled with -1) of shipping address and billing address
nan_add = analytical_df[(analytical_df.shipping_add == -1) | (analytical_df.billing_add == -1)]
nan_add
Out[42]:
customer order target_gender customer_gender shipping_add billing_add target_gen customer_gen
176598 23628820 80857940 None F -1 -1 0 2
176670 12139903 57607230 None M -1 -1 0 1
176720 17688936 92484802 None F -1 -1 0 2
176859 15718378 74604902 None M -1 -1 0 1
176880 23537079 21144903 None F -1 -1 0 2
... ... ... ... ... ... ... ... ...
191243 20544734 43859729 None M -1 -1 0 1
191305 24585987 67772519 None M -1 -1 0 1
191321 24677732 84289742 None F -1 -1 0 2
652119 31445010 21145863 None M -1 -1 0 1
667185 23981434 90094119 None M -1 -1 0 1

223 rows × 8 columns

We have 223 orders that have null Shipping/Billing Addresses.

In [43]:
# Analytical data column for addresses
nan_add['address'] = np.where(nan_add['shipping_add'] == -1, 0,
                     np.where(nan_add['billing_add']  == -1, 0, 1))
nan_add.head()
C:\Users\affine\AppData\Local\Temp\ipykernel_16568\3818571357.py:2: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  nan_add['address'] = np.where(nan_add['shipping_add'] == -1, 0,
Out[43]:
customer order target_gender customer_gender shipping_add billing_add target_gen customer_gen address
176598 23628820 80857940 None F -1 -1 0 2 0
176670 12139903 57607230 None M -1 -1 0 1 0
176720 17688936 92484802 None F -1 -1 0 2 0
176859 15718378 74604902 None M -1 -1 0 1 0
176880 23537079 21144903 None F -1 -1 0 2 0
In [44]:
# Checking unique values of address column in null dataset
nan_add.address.unique()
Out[44]:
array([0])
In [45]:
# Creating a separate dataset for not null rows(not equal to -1) of shipping and billing addresses
non_nan_add = analytical_df[(analytical_df.shipping_add != -1) | (analytical_df.billing_add != -1)]
non_nan_add
Out[45]:
customer order target_gender customer_gender shipping_add billing_add target_gen customer_gen
41 31371922 45832107 NaN F 11855742 11855742 0 2
42 12192824 26092410 NaN M 11817631 7867151 0 1
43 22167265 15123721 NaN M 11384810 11384809 0 1
44 23403248 28743548 NaN M 10642692 10642691 0 1
45 1590647 30915560 NaN F 10901412 10901412 0 2
... ... ... ... ... ... ... ... ...
677949 3740508 48317500 NaN F 10435269 10435269 0 2
677950 22859237 64797963 NaN M 10399071 10399071 0 1
677951 23036258 22989040 NaN M 10466130 10466130 0 1
677952 10708737 78485771 NaN M 10893649 10893648 0 1
677954 18101292 51603645 NaN M 10412311 10422395 0 1

545598 rows × 8 columns

In [46]:
# Analytical data column for addresses
non_nan_add['address'] = np.where(non_nan_add['shipping_add']==non_nan_add['billing_add'], 1, 
                         np.where(non_nan_add['shipping_add']!=non_nan_add['billing_add'], 2, 0))
non_nan_add.head()
C:\Users\affine\AppData\Local\Temp\ipykernel_16568\1429238194.py:2: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  non_nan_add['address'] = np.where(non_nan_add['shipping_add']==non_nan_add['billing_add'], 1,
Out[46]:
customer order target_gender customer_gender shipping_add billing_add target_gen customer_gen address
41 31371922 45832107 NaN F 11855742 11855742 0 2 1
42 12192824 26092410 NaN M 11817631 7867151 0 1 2
43 22167265 15123721 NaN M 11384810 11384809 0 1 2
44 23403248 28743548 NaN M 10642692 10642691 0 1 2
45 1590647 30915560 NaN F 10901412 10901412 0 2 1
In [47]:
# Checking unique values of address column in non null dataset
non_nan_add.address.unique()
Out[47]:
array([1, 2])
In [48]:
# Merging these two datas
analytical_df = pd.concat([nan_add, non_nan_add])
analytical_df.sort_values('customer', ascending = True, inplace = True)
analytical_df
Out[48]:
customer order target_gender customer_gender shipping_add billing_add target_gen customer_gen address
611486 692 58224365 Female F 11565604 5394034 2 2 2
16691 959 59011871 NaN F 11822256 3839390 0 2 2
175629 959 59011871 Female F 11822256 3839390 2 2 2
452063 1084 11849472 Female M 11580711 11580710 2 1 2
598424 1359 78326444 None M 11563268 11563268 0 1 1
... ... ... ... ... ... ... ... ... ...
128085 31490634 68606731 None F 12031454 12031454 0 2 1
560725 31490635 93951222 Female M 12031473 12031473 2 1 1
398927 31490636 56973875 Female F 12031477 12031477 2 2 1
214912 31490637 70914185 Female M 12031478 12031478 2 1 1
249992 31490637 70914185 None M 12031478 12031478 0 1 1

545821 rows × 9 columns

In [49]:
# Checking unique values of address column of analytical dataset
analytical_df.address.unique()
Out[49]:
array([2, 1, 0])

Address Analytical Column Info

  • Address = 1 then Shipping Address & Billing Address are Equal
  • Address = 2 then Shipping Address & Billing Address are Not-Equal
  • Address = 0 then Both Shipping Address & Billing Address are NULL or one of them are NULL

Quality check of Analytical Dataset¶

In [50]:
# Quick check to see if the analytical dataset has the right data  
qc = analytical_df.loc[(jewelry_df['email_address_key'] == 10106)]
qc
Out[50]:
customer order target_gender customer_gender shipping_add billing_add target_gen customer_gen address
262754 10106 78391343 None M 11550719 3306065 0 1 2
441382 10106 78391343 Female M 11550719 3306065 2 1 2
505052 10106 78391343 Female M 11550719 3306065 2 1 2
In [51]:
# Dropping original gender and address columns as we have analytical dataset
analytical_df = analytical_df.drop(['target_gender','customer_gender','shipping_add','billing_add'], axis = 1)
analytical_df.head()
Out[51]:
customer order target_gen customer_gen address
611486 692 58224365 2 2 2
16691 959 59011871 0 2 2
175629 959 59011871 2 2 2
452063 1084 11849472 2 1 2
598424 1359 78326444 0 1 1

Gifter, Self-Gifter Flag¶

In [52]:
# Creating gifter flag using the rules
analytical_df['gifter'] = np.where((analytical_df['target_gen']==1) & (analytical_df['customer_gen']==2), 1,                          
                          np.where((analytical_df['target_gen']==2) & (analytical_df['customer_gen']==1), 1 ,        
                          np.where((analytical_df['target_gen']==1) & (analytical_df['customer_gen']==1) 
                                                                    & (analytical_df['address']==2), 1,    
                          np.where((analytical_df['target_gen']==2) & (analytical_df['customer_gen']==2) 
                                                                    & (analytical_df['address']==2), 1,
                          np.where((analytical_df['target_gen']==0) & (analytical_df['address']==2), 1,           
                          np.where((analytical_df['customer_gen']==0) & (analytical_df['address']==2), 1, 0))))))
analytical_df.head()
Out[52]:
customer order target_gen customer_gen address gifter
611486 692 58224365 2 2 2 1
16691 959 59011871 0 2 2 1
175629 959 59011871 2 2 2 1
452063 1084 11849472 2 1 2 1
598424 1359 78326444 0 1 1 0
In [53]:
# Creating self-gifter flag
analytical_df['self_gifter'] = np.where((analytical_df['gifter'] == 1), 0, 
                               np.where((analytical_df['gifter'] == 0) & (analytical_df['address']==0), 0, 1))
analytical_df.head()
Out[53]:
customer order target_gen customer_gen address gifter self_gifter
611486 692 58224365 2 2 2 1 0
16691 959 59011871 0 2 2 1 0
175629 959 59011871 2 2 2 1 0
452063 1084 11849472 2 1 2 1 0
598424 1359 78326444 0 1 1 0 1

Now, we tagged almost every customer as either Gifter or Self-Gifter.

Final Flag Table¶

In [54]:
# Calculating sum of gifter and self_gifter flags per customer
cols = {'gifter':'sum_of_gifter', 'self_gifter':'sum_of_selfgifter'}
final_flag = analytical_df.groupby('customer').agg({'gifter':'sum','self_gifter':'sum'}).rename(columns=cols)
final_flag.sort_values(by=['customer'],ascending=True)
Out[54]:
sum_of_gifter sum_of_selfgifter
customer
692 1 0
959 2 0
1084 1 0
1359 1 1
9775 1 0
... ... ...
31490631 1 1
31490634 0 1
31490635 1 0
31490636 0 1
31490637 1 1

355655 rows × 2 columns

Taking the Sum of those gifter and self-gifter columns for each customer to decide on whether they one of them or kind of both gifter and self-gifter.

In [55]:
# Creating final flag
final_flag['final_flag'] = np.where((final_flag['sum_of_gifter']!=0) & (final_flag['sum_of_selfgifter']==0), 1,
                           np.where((final_flag['sum_of_gifter']==0) & (final_flag['sum_of_selfgifter']!=0), 2 ,        
                           np.where((final_flag['sum_of_gifter']!=0) & (final_flag['sum_of_selfgifter']!=0), 3, 0)))
final_flag.reset_index()
Out[55]:
customer sum_of_gifter sum_of_selfgifter final_flag
0 692 1 0 1
1 959 2 0 1
2 1084 1 0 1
3 1359 1 1 3
4 9775 1 0 1
... ... ... ... ...
355650 31490631 1 1 3
355651 31490634 0 1 2
355652 31490635 1 0 1
355653 31490636 0 1 2
355654 31490637 1 1 3

355655 rows × 4 columns

In [56]:
# Checking unique values of final flag
final_flag.final_flag.unique()
Out[56]:
array([1, 3, 2, 0])

Final Flag Column Info

  • Final flag = 1 then Gifter
  • Final flag = 2 then Self-Gifter
  • Final flag = 3 then Both Gifter and Self-Gifter
  • Final flag = 0 then Status Not Known

Final Output of Jewelry Gifting Segmentation¶

In [57]:
# Creating final output
final_out = final_flag.drop(['sum_of_gifter','sum_of_selfgifter'], axis = 1)
final_out = final_out.rename_axis('customer').reset_index()
final_out.head()
Out[57]:
customer final_flag
0 692 1
1 959 1
2 1084 1
3 1359 3
4 9775 1
In [58]:
# Changing the flags to names for clarity
final_out['final_flag'] = np.where(final_out['final_flag']==1, 'Gifter',
                          np.where(final_out['final_flag']==2, 'Self-Gifter',        
                          np.where(final_out['final_flag']==3, 'Both', 'Not Known')))
final_out
Out[58]:
customer final_flag
0 692 Gifter
1 959 Gifter
2 1084 Gifter
3 1359 Both
4 9775 Gifter
... ... ...
355650 31490631 Both
355651 31490634 Self-Gifter
355652 31490635 Gifter
355653 31490636 Self-Gifter
355654 31490637 Both

355655 rows × 2 columns

In [59]:
# Pie chart representation of the segmentation

gifts_segment = final_out.groupby('final_flag').agg({'final_flag':'first', 'customer':'nunique'})

a = gifts_segment.index
b = gifts_segment['customer']

plt.pie(b, labels = a, radius = 1.5,textprops = {"fontsize":15}, autopct = "%1.1f%%",
       wedgeprops = {"edgecolor":"black",'linewidth':0.5})

plt.show()
In [60]:
# Grouping final flag 
final_out['final_flag'].value_counts()
Out[60]:
Gifter         228764
Self-Gifter     95858
Both            30919
Not Known         114
Name: final_flag, dtype: int64

Gifting Flag Inference:

  • There are about 65% Gifters
  • 27% are Self-Gifters
  • 9% people are Both Gifters and Self-Gifters

RFM Segmentation¶

RFM for Engagement Category¶

Engagement Category Data¶

In [61]:
# Only keeping the Engagement category data in non-returns final merged data
engagement_df = final_merged.loc[(final_merged['merch_category_rollup'] == 'Engagement') | 
                                 (final_merged['merch_category_rollup'] == 'Loose Diamond')]
engagement_df
Out[61]:
merch_category_rollup merch_sub_category_rollup MERCH_PRODUCT_CATEGORY email_address_key basket_id original_basket_id order_type site basket_start_date submit_date ... derived_gender age OFFER_KEY_x name TARGET_GENDER display_type PRODUCT_KEY OFFER_KEY_y BILL_TO_ADDRESS_KEY SHIP_TO_ADDRESS_KEY
2 Engagement Engagement BYO Semi-Mounts 19696924 23989389 23989389 SPECIAL ORDER BN 06-06-2021 2021-06-07 ... M 31.0 NaN NaN NaN NaN 3565789.0 76212.0 11129079 11129080
3 Engagement Engagement BYO Semi-Mounts 24120678 12758238 12758238 STANDARD BN 20-09-2021 2021-09-20 ... M 37.0 NaN NaN NaN NaN 3601472.0 76207.0 11329644 11178488
5 Engagement Engagement BYO Semi-Mounts 24092029 18812838 18812838 STANDARD BN 11-07-2021 2021-08-18 ... M 34.0 NaN NaN NaN NaN 2604603.0 60452.0 11264941 11264941
6 Engagement Engagement BYO Semi-Mounts 24079756 47744213 47744213 STANDARD BN 09-06-2021 2021-06-09 ... M NaN NaN NaN NaN NaN 2888321.0 63046.0 11137487 11135306
7 Engagement Engagement BYO Semi-Mounts 11025516 83308313 83308313 SPECIAL ORDER BN 24-06-2021 2021-06-25 ... M NaN NaN NaN NaN NaN 3577155.0 76455.0 11163959 11163960
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
677943 Engagement Engagement BYO 3-Stone Metal 24281009 65117329 65117329 STANDARD BN 29-09-2021 2021-10-08 ... M NaN NaN NaN NaN NaN 2737184.0 53140.0 11366100 11366101
677944 Engagement Engagement BYO Semi-Mounts 15321364 71624005 71624005 SPECIAL ORDER BN 23-06-2022 2022-06-20 ... F 1968.0 NaN NaN NaN NaN 3855345.0 76309.0 9762218 11955334
677945 Engagement Engagement BYO 3-Stone Metal 24074257 14960454 14960454 STANDARD BN 24-06-2021 2021-06-07 ... M 23.0 NaN NaN NaN NaN 2096005.0 49871.0 11158433 11158434
677948 Engagement Engagement BYO 3-Stone Metal 24520746 82913867 82913867 STANDARD BN 11-11-2021 2021-11-11 ... F NaN NaN NaN NaN NaN 3253923.0 53140.0 11442830 11442831
677953 Engagement Engagement BYO 3-Stone Metal 19190442 64237203 64237203 STANDARD BN 08-10-2020 2020-10-08 ... M 60.0 NaN NaN NaN NaN 2098637.0 49876.0 10504692 10504692

79322 rows × 43 columns

Engagement data has about 79,322 orders.

In [62]:
# Checking if the categories remained in the above data do belong to only those 2 categories
engagement_df.merch_category_rollup.unique()
Out[62]:
array(['Engagement', 'Loose Diamond'], dtype=object)
In [63]:
# Columns in the engagement data
list(engagement_df.columns)
Out[63]:
['merch_category_rollup',
 'merch_sub_category_rollup',
 'MERCH_PRODUCT_CATEGORY',
 'email_address_key',
 'basket_id',
 'original_basket_id',
 'order_type',
 'site',
 'basket_start_date',
 'submit_date',
 'order_date',
 'ship_date',
 'ship_delivery_date',
 'return_date',
 'discount_name',
 'local_currency',
 'promo_discount_applied_flag',
 'local_currency_before_disc_sale_amount',
 'discount_promo_amount',
 'local_currency_after_disc_sale_amount',
 'usd_after_disc_sale_amount',
 'usd_after_disc_cost_amount',
 'usd_exchange_credit_amount',
 'quantity',
 'DIAMOND_SKU_1',
 'DIAMOND_SKU_2',
 'DIAMOND_SKU_4',
 'DIAMOND_SKU_5',
 'OFFER_ID',
 'email_Address',
 'email_address',
 'first_name',
 'last_name',
 'derived_gender',
 'age',
 'OFFER_KEY_x',
 'name',
 'TARGET_GENDER',
 'display_type',
 'PRODUCT_KEY',
 'OFFER_KEY_y',
 'BILL_TO_ADDRESS_KEY',
 'SHIP_TO_ADDRESS_KEY']
In [64]:
# Dropping unnecessary and repetitive columns due to joins
engagement_df=engagement_df.drop(['merch_category_rollup','merch_sub_category_rollup','MERCH_PRODUCT_CATEGORY',
                                  'original_basket_id','order_type','site','basket_start_date','ship_date','ship_delivery_date',
                                  'return_date','discount_name','local_currency','promo_discount_applied_flag',
                                  'local_currency_before_disc_sale_amount','discount_promo_amount',
                                  'local_currency_after_disc_sale_amount','usd_after_disc_cost_amount',
                                  'usd_exchange_credit_amount','quantity','DIAMOND_SKU_1','DIAMOND_SKU_2','DIAMOND_SKU_4',
                                  'DIAMOND_SKU_5','OFFER_ID','email_Address','email_address','first_name','last_name',
                                  'OFFER_KEY_x','name','display_type','PRODUCT_KEY','OFFER_KEY_y'],axis = 1)
engagement_df
Out[64]:
email_address_key basket_id submit_date order_date usd_after_disc_sale_amount derived_gender age TARGET_GENDER BILL_TO_ADDRESS_KEY SHIP_TO_ADDRESS_KEY
2 19696924 23989389 2021-06-07 2021-06-07 5621.00 M 31.0 NaN 11129079 11129080
3 24120678 12758238 2021-09-20 2021-09-20 2759.00 M 37.0 NaN 11329644 11178488
5 24092029 18812838 2021-08-18 2021-08-18 3178.00 M 34.0 NaN 11264941 11264941
6 24079756 47744213 2021-06-09 2021-06-13 1668.00 M NaN NaN 11137487 11135306
7 11025516 83308313 2021-06-25 2021-06-25 3749.00 M NaN NaN 11163959 11163960
... ... ... ... ... ... ... ... ... ... ...
677943 24281009 65117329 2021-10-08 2021-10-08 25481.50 M NaN NaN 11366100 11366101
677944 15321364 71624005 2022-06-20 2022-06-28 13973.43 F 1968.0 NaN 9762218 11955334
677945 24074257 14960454 2021-06-07 2021-06-27 3758.00 M 23.0 NaN 11158433 11158434
677948 24520746 82913867 2021-11-11 2021-11-16 17564.52 F NaN NaN 11442830 11442831
677953 19190442 64237203 2020-10-08 2020-10-09 6235.00 M 60.0 NaN 10504692 10504692

79322 rows × 10 columns

In [65]:
# Changing email address key column as customer
engagement_df = engagement_df.rename(columns = {'email_address_key' : 'Customer'})

# Taking only columns that we need
engagement_df = engagement_df[['Customer', 'submit_date', 'basket_id', 'usd_after_disc_sale_amount']]

# Sorting them in the Descending order of Submit Date
engagement_df.sort_values(by = ['submit_date'], ascending = False)
Out[65]:
Customer submit_date basket_id usd_after_disc_sale_amount
443889 31490238 2022-08-03 34373726 4232.50
583219 31480521 2022-08-03 88201332 24945.62
671734 31486262 2022-08-03 84878846 862.74
528026 31490243 2022-08-03 72212082 1238.50
615416 31490189 2022-08-03 69803977 6997.35
... ... ... ... ...
3209 22838239 2020-08-01 79139376 10339.00
334138 22838107 2020-08-01 72716443 3042.00
28613 22834473 2020-08-01 88117283 816.00
11438 22838240 2020-08-01 47732552 3997.00
626897 22809875 2020-08-01 28246872 2844.99

79322 rows × 4 columns

In [66]:
# Counting unique values in each column
engagement_df.nunique()
Out[66]:
Customer                      70432
submit_date                     733
basket_id                     75453
usd_after_disc_sale_amount    50418
dtype: int64

We have about 70,432 customers in Engagement Category.

In [67]:
# Grouping at customer level
columns = {'submit_date':'Recency_Date','basket_id':'Frequency', 'usd_after_disc_sale_amount':'Sales_Amount'}
engagement_rfm = engagement_df.groupby('Customer').agg({'submit_date':'first','basket_id':'nunique', 
                                                        'usd_after_disc_sale_amount':'sum'}).rename(columns=columns).round(decimals=2)
engagement_rfm.reset_index()
Out[67]:
Customer Recency_Date Frequency Sales_Amount
0 959 2022-04-12 1 2685.19
1 10261 2020-10-17 3 9604.79
2 10728 2022-05-16 1 4408.00
3 11625 2021-05-31 1 83.18
4 12177 2021-02-25 1 10010.69
... ... ... ... ...
70427 31490557 2022-08-03 1 1181.50
70428 31490598 2022-08-03 1 3111.84
70429 31490625 2022-08-03 1 7437.50
70430 31490630 2022-08-03 1 2181.57
70431 31490633 2022-08-03 1 3490.50

70432 rows × 4 columns

In [68]:
# Getting most recent ordered date
reference_date = engagement_rfm['Recency_Date'].max() + pd.DateOffset(days=1)
reference_date
Out[68]:
Timestamp('2022-08-04 00:00:00')

Reference Date is the next day to the most recent date in the Recency Date column which we take it as a reference to calculate the recency days.

In [69]:
# Creating Recency column in days format
engagement_rfm['Recency'] = engagement_rfm['Recency_Date'].apply(lambda row: (reference_date - row).days)
engagement_rfm.drop('Recency_Date', inplace = True, axis = 1)

engagement_rfm = engagement_rfm[['Recency', 'Frequency', 'Sales_Amount']]
engagement_rfm.head()
Out[69]:
Recency Frequency Sales_Amount
Customer
959 114 1 2685.19
10261 656 3 9604.79
10728 80 1 4408.00
11625 430 1 83.18
12177 525 1 10010.69
In [70]:
# Creating Monetary column
engagement_rfm['Monetary'] = (engagement_rfm['Sales_Amount']/engagement_rfm['Frequency']).round(decimals=2)

# Dropping Sales Amount column
engagement_rfm = engagement_rfm.drop('Sales_Amount', axis = 1)
engagement_rfm.reset_index(inplace=True)
engagement_rfm.head()
Out[70]:
Customer Recency Frequency Monetary
0 959 114 1 2685.19
1 10261 656 3 3201.60
2 10728 80 1 4408.00
3 11625 430 1 83.18
4 12177 525 1 10010.69

We now created the Rececy, Frequency and Monetary value of all 70,432 customers.

In [71]:
# Checking the minimum value of Monetary Column 
engagement_rfm['Monetary'].min()
Out[71]:
0.0
In [72]:
# Removing those values with 0 monetary value after checking the purchase info 
engagement_rfm = engagement_rfm[engagement_rfm.Monetary != 0] 

# Checking if 0 Monetary values are removed or not
engagement_rfm['Monetary'].min()
Out[72]:
49.0
In [73]:
# Checking Recency Range
a, b = engagement_rfm.Recency.min(), engagement_rfm.Recency.max()
a, b
Out[73]:
(1, 733)
In [74]:
# Checking Frequency Range
c, d = engagement_rfm.Frequency.min(), engagement_rfm.Frequency.max()
c, d
Out[74]:
(1, 1835)
In [75]:
# Checking Monetary Range
e, f = engagement_rfm.Monetary.min(), engagement_rfm.Monetary.max()
e, f
Out[75]:
(49.0, 810965.47)
In [76]:
# Creating bins to check distribution
rec_range = [0, 90, 180, 270, 360, 450, 540, 630, 740]
rec_labels = ['<3','3-6','6-9','9-12','12-15','15-18','18-21','21+']

freq_range  = [0,  1,   2,   5,   20,  50,  1835]
freq_labels = ['1','2','3-5','6-20', '21-50','>50']

money_range  = [0, 1000,  3000,  5000,  10000, 20000, 50000, 70000, 100000, 900000]
money_labels = ['<1000','1-3k','3-5k', '5-10k','10-20k','20-50k','50-70k','70k-1lc', '>1lc']


engagement_rfm['R_Labels'] = pd.cut(engagement_rfm['Recency'], bins = rec_range, labels = rec_labels)
engagement_rfm['F_Labels'] = pd.cut(engagement_rfm['Frequency'], bins = freq_range, labels = freq_labels)
engagement_rfm['M_Labels'] = pd.cut(engagement_rfm['Monetary'], bins = money_range, labels = money_labels)
engagement_rfm.head()
Out[76]:
Customer Recency Frequency Monetary R_Labels F_Labels M_Labels
0 959 114 1 2685.19 3-6 1 1-3k
1 10261 656 3 3201.60 21+ 3-5 3-5k
2 10728 80 1 4408.00 <3 1 3-5k
3 11625 430 1 83.18 12-15 1 <1000
4 12177 525 1 10010.69 15-18 1 10-20k
In [77]:
# Checking the Distribution of the Bins

fig = plt.figure(figsize=(14,14))

# Recency Plot
ax1 = plt.subplot2grid((2,3),(0,0))
ax = engagement_rfm['R_Labels'].value_counts().plot(kind='bar', width = 0.4, edgecolor = None)
plt.title('Recency Bins Distribution', y = 1)
for p in ax.patches:
    width = p.get_width()
    height = p.get_height()
    x, y = p.get_xy() 
    ax.annotate(f'{height}', (x + width/2, y + height*1.02), ha='center')

# Frequency Plot
ax1 = plt.subplot2grid((2,3), (0,1))
ax = engagement_rfm['F_Labels'].value_counts().plot(kind='bar', width = 0.4, edgecolor = None)
plt.title('Frequency Bins Distribution', y = 1)
for p in ax.patches:
    width = p.get_width()
    height = p.get_height()
    x, y = p.get_xy() 
    ax.annotate(f'{height}', (x + width/2, y + height*1.02), ha='center')

# Monetary Plot
ax1 = plt.subplot2grid((2,3), (0,2))
ax = engagement_rfm['M_Labels'].value_counts().plot(kind='bar', width = 0.4, edgecolor = None)
plt.title('Monetary Bins Distribution', y = 1)
for p in ax.patches:
    width = p.get_width()
    height = p.get_height()
    x, y = p.get_xy() 
    ax.annotate(f'{height}', (x + width/2, y + height*1.02), ha='center')

fig.tight_layout()
plt.show()
In [78]:
# Getting the number of customers over different Bins Distribution

a = ((engagement_rfm['R_Labels'].value_counts(normalize = True) * 100).round(decimals=2)).to_dict()
b = ((engagement_rfm['F_Labels'].value_counts(normalize = True) * 100).round(decimals=2)).to_dict()
c = ((engagement_rfm['M_Labels'].value_counts(normalize = True) * 100).round(decimals=2)).to_dict()
a, b, c
Out[78]:
({'21+': 16.5,
  '18-21': 13.63,
  '15-18': 12.84,
  '6-9': 12.83,
  '9-12': 11.61,
  '12-15': 11.56,
  '<3': 10.71,
  '3-6': 10.33},
 {'1': 97.94,
  '2': 1.58,
  '3-5': 0.36,
  '6-20': 0.11,
  '21-50': 0.01,
  '>50': 0.01},
 {'1-3k': 26.76,
  '5-10k': 25.38,
  '3-5k': 20.9,
  '10-20k': 13.6,
  '<1000': 7.18,
  '20-50k': 5.29,
  '50-70k': 0.46,
  '70k-1lc': 0.22,
  '>1lc': 0.21})

Recency

  • As we have uniform distribution across different Recency levels, We can divide the Recency into 4 bins each of 6 months timespan as these are Luxury goods where people don't buy that frequently.
  • For Recency, Recent customers are more valuable so ranking is higher for the recent customers.

Frequency

  • We have 98% one time customers and so we can create 3 bins for frequency.

Monetary

  • Creating 5 bins of Monetary in which 3 bins are for the transactions below 20,000 dollars as 93% of orders are done below 20,000.
In [79]:
# Dropping Label columns of R, F and M as we no longer need them
engagement_rfm = engagement_rfm.drop(['R_Labels','F_Labels','M_Labels'],axis = 1)
engagement_rfm.head()
Out[79]:
Customer Recency Frequency Monetary
0 959 114 1 2685.19
1 10261 656 3 3201.60
2 10728 80 1 4408.00
3 11625 430 1 83.18
4 12177 525 1 10010.69
In [80]:
# Function to create R value
def Rvalue(x):
    if x <= 180:
        return 4
    elif x <= 360:
        return 3
    elif x <= 540: 
        return 2
    else:
        return 1

# Function to create F value    
def Fvalue(x):
    if x <= 1:
        return 1
    else:
        return 2   

# Function to create M value    
def Mvalue(x):
    if x <= 2500:
        return 1
    elif x <= 5000:
        return 2
    elif x <= 10000:
        return 3
    elif x <= 25000: 
        return 4
    else:
        return 5
In [81]:
# Adding R, F and M segment value columns to the existing dataset
engagement_rfm['R'] = engagement_rfm['Recency'].apply(Rvalue)
engagement_rfm['F'] = engagement_rfm['Frequency'].apply(Fvalue)
engagement_rfm['M'] = engagement_rfm['Monetary'].apply(Mvalue)
engagement_rfm.head()
Out[81]:
Customer Recency Frequency Monetary R F M
0 959 114 1 2685.19 4 1 2
1 10261 656 3 3201.60 1 2 2
2 10728 80 1 4408.00 4 1 2
3 11625 430 1 83.18 2 1 1
4 12177 525 1 10010.69 2 1 4
In [82]:
# Checking the Distribution of the bins

fig = plt.figure(figsize=(14,14))

# Recency Plot
ax1 = plt.subplot2grid((2,3),(0,0))
ax = engagement_rfm['R'].value_counts().plot(kind='bar', width = 0.4, edgecolor = None)
plt.title('Recency Distribution', y = 1)
for p in ax.patches:
    width = p.get_width()
    height = p.get_height()
    x, y = p.get_xy() 
    ax.annotate(f'{height}', (x + width/2, y + height*1.02), ha='center')

# Frequency Plot
ax1 = plt.subplot2grid((2,3), (0,1))
ax = engagement_rfm['F'].value_counts().plot(kind='bar', width = 0.4, edgecolor = None)
plt.title('Frequency Distribution', y = 1)
for p in ax.patches:
    width = p.get_width()
    height = p.get_height()
    x, y = p.get_xy() 
    ax.annotate(f'{height}', (x + width/2, y + height*1.02), ha='center')

# Monetary Plot
ax1 = plt.subplot2grid((2,3), (0,2))
ax = engagement_rfm['M'].value_counts().plot(kind='bar', width = 0.4, edgecolor = None)
plt.title('Monetary Distribution', y = 1)
for p in ax.patches:
    width = p.get_width()
    height = p.get_height()
    x, y = p.get_xy() 
    ax.annotate(f'{height}', (x + width/2, y + height*1.02), ha='center')

fig.tight_layout()
plt.show()
In [83]:
# Getting the number of customers over different final Bins Distribution

a = ((engagement_rfm['R'].value_counts(normalize = True) * 100).round(decimals=2)).to_dict()
b = ((engagement_rfm['F'].value_counts(normalize = True) * 100).round(decimals=2)).to_dict()
c = ((engagement_rfm['M'].value_counts(normalize = True) * 100).round(decimals=2)).to_dict()
sorted(a.items()), sorted(b.items()), sorted(c.items())
Out[83]:
([(1, 30.12), (2, 24.4), (3, 24.44), (4, 21.04)],
 [(1, 97.94), (2, 2.06)],
 [(1, 27.57), (2, 27.27), (3, 25.38), (4, 15.9), (5, 3.88)])
In [84]:
# Concatinating the 3 values as final RFM column
engagement_rfm['RFM'] = engagement_rfm.R.map(str) + engagement_rfm.F.map(str) + engagement_rfm.M.map(str)
engagement_rfm
Out[84]:
Customer Recency Frequency Monetary R F M RFM
0 959 114 1 2685.19 4 1 2 412
1 10261 656 3 3201.60 1 2 2 122
2 10728 80 1 4408.00 4 1 2 412
3 11625 430 1 83.18 2 1 1 211
4 12177 525 1 10010.69 2 1 4 214
... ... ... ... ... ... ... ... ...
70427 31490557 1 1 1181.50 4 1 1 411
70428 31490598 1 1 3111.84 4 1 2 412
70429 31490625 1 1 7437.50 4 1 3 413
70430 31490630 1 1 2181.57 4 1 1 411
70431 31490633 1 1 3490.50 4 1 2 412

70430 rows × 8 columns

In [85]:
# Unique combinations of RFM column
eng = engagement_rfm.RFM.unique()
print(sorted(eng))
['111', '112', '113', '114', '115', '121', '122', '123', '124', '125', '211', '212', '213', '214', '215', '221', '222', '223', '224', '225', '311', '312', '313', '314', '315', '321', '322', '323', '324', '325', '411', '412', '413', '414', '415', '421', '422', '423', '424', '425']
In [86]:
# Total Distinct Customer categories we got through RFM
engagement_rfm.RFM.nunique()
Out[86]:
40
In [88]:
engagement_rfm['Segment'] = np.where(engagement_rfm['RFM'].isin(['425','415','325','315','225','215','125','115']), 
                                     'VIP', 
                            np.where(engagement_rfm['RFM'].isin(['424','423','324','323']), 'Best',
                            np.where(engagement_rfm['RFM'].isin(['414','413','314','313','224','223','214','213','124','123',
                                                                 '114','113']), 'Potential',
                            np.where(engagement_rfm['RFM'].isin(['422','421','322','321']), 'Promising', 
                            np.where(engagement_rfm['RFM'].isin(['412','411','312','311']), 'Rookies', 
                            np.where(engagement_rfm['RFM'].isin(['222','221','212','211']), 'Winback',
                            np.where(engagement_rfm['RFM'].isin(['122','121','112','111']),
                                     'Churned','Others')))))))        
engagement_rfm
Out[88]:
Customer Recency Frequency Monetary R F M RFM Segment
0 959 114 1 2685.19 4 1 2 412 Rookies
1 10261 656 3 3201.60 1 2 2 122 Churned
2 10728 80 1 4408.00 4 1 2 412 Rookies
3 11625 430 1 83.18 2 1 1 211 Winback
4 12177 525 1 10010.69 2 1 4 214 Potential
... ... ... ... ... ... ... ... ... ...
70427 31490557 1 1 1181.50 4 1 1 411 Rookies
70428 31490598 1 1 3111.84 4 1 2 412 Rookies
70429 31490625 1 1 7437.50 4 1 3 413 Potential
70430 31490630 1 1 2181.57 4 1 1 411 Rookies
70431 31490633 1 1 3490.50 4 1 2 412 Rookies

70430 rows × 9 columns

In [89]:
# Plotting the Segmentation
eng_seg = engagement_rfm.groupby(['Segment'])['Customer'].count()
ax = eng_seg.plot(kind='bar', figsize=(6,6), width = 0.5, edgecolor=None)

for p in ax.patches:
    width = p.get_width()
    height = p.get_height()
    x, y = p.get_xy() 
    ax.annotate(f'{height}', (x + width/2, y + height*1.02), ha='center')
In [90]:
# Percentage of customers among different segments
(engagement_rfm['Segment'].value_counts(normalize=True) * 100).round(decimals=2)
Out[90]:
Potential    41.03
Rookies      23.26
Churned      17.67
Winback      13.24
VIP           3.88
Promising     0.66
Best          0.25
Name: Segment, dtype: float64

We divided the customers into 6 kind of segments and we have Potential Customers more among all the segments having 33% of distribution.

RFM Segmentation for Jewelry Category¶

In [91]:
# Getting Jewelry data Info
jewelry_df
Out[91]:
email_address_key basket_id submit_date order_date usd_after_disc_sale_amount derived_gender age TARGET_GENDER BILL_TO_ADDRESS_KEY SHIP_TO_ADDRESS_KEY
41 31371922 45832107 2022-04-30 2022-04-30 1481.0 F NaN NaN 11855742 11855742
42 12192824 26092410 2022-04-11 2022-04-11 2625.0 M 41.0 NaN 7867151 11817631
43 22167265 15123721 2021-10-18 2021-10-18 3778.0 M NaN NaN 11384809 11384810
44 23403248 28743548 2020-11-30 2020-11-30 1124.0 M NaN NaN 10642691 10642692
45 1590647 30915560 2021-02-21 2021-02-21 1543.0 F NaN NaN 10901412 10901412
... ... ... ... ... ... ... ... ... ... ...
677949 3740508 48317500 2020-08-28 2020-08-30 1437.0 F NaN NaN 10435269 10435269
677950 22859237 64797963 2020-08-10 2020-08-10 1712.0 M 42.0 NaN 10399071 10399071
677951 23036258 22989040 2020-09-15 2020-09-16 2380.0 M NaN NaN 10466130 10466130
677952 10708737 78485771 2021-02-16 2021-02-16 2357.0 M 39.0 NaN 10893648 10893649
677954 18101292 51603645 2020-08-22 2020-08-23 3545.0 M NaN NaN 10422395 10412311

545821 rows × 10 columns

In [92]:
# Changing email address key column as customer
jewelry_df = jewelry_df.rename(columns = {'email_address_key' : 'Customer'})

# Taking only columns that we need
jewelry_df = jewelry_df[['Customer', 'submit_date', 'basket_id', 'usd_after_disc_sale_amount']]

# Sorting them in the Descending order of Submit Date
jewelry_df.sort_values(by = ['submit_date'], ascending = False)
Out[92]:
Customer submit_date basket_id usd_after_disc_sale_amount
627090 31441046 2022-08-03 21152111 7982.10
441805 24304301 2022-08-03 35990748 1500.00
276327 31490604 2022-08-03 11192980 30.00
516607 31490214 2022-08-03 67543863 841.50
400545 31490605 2022-08-03 83973282 1352.00
... ... ... ... ...
230121 22838650 2020-08-01 70850136 265.00
362450 22822119 2020-08-01 89567762 3946.05
506077 20931609 2020-08-01 21512752 266.00
531063 22838480 2020-08-01 18121710 553.00
437620 2835069 2020-08-01 44203668 1962.00

545821 rows × 4 columns

In [93]:
# Counting unique values in each column
jewelry_df.nunique()
Out[93]:
Customer                      355655
submit_date                      733
basket_id                     446618
usd_after_disc_sale_amount     54879
dtype: int64
In [94]:
# Grouping at customer level
columns = {'submit_date':'Recency_Date','basket_id':'Frequency', 'usd_after_disc_sale_amount':'Sales_Amount'}
jewelry_rfm = jewelry_df.groupby('Customer').agg({'submit_date':'first','basket_id':'nunique', 
                                                  'usd_after_disc_sale_amount':'sum'}).rename(columns=columns).round(decimals=2)
jewelry_rfm.reset_index()
Out[94]:
Customer Recency_Date Frequency Sales_Amount
0 692 2021-12-16 1 862.50
1 959 2022-04-12 1 5487.50
2 1084 2021-12-19 1 2242.50
3 1359 2021-12-15 2 1041.00
4 9775 2022-05-01 1 3435.00
... ... ... ... ...
355650 31490631 2022-08-03 1 355.00
355651 31490634 2022-08-03 1 50.29
355652 31490635 2022-08-03 1 203.00
355653 31490636 2022-08-03 1 1032.00
355654 31490637 2022-08-03 1 160.00

355655 rows × 4 columns

In [95]:
# Getting most recent ordered date
reference_date = jewelry_rfm['Recency_Date'].max() + pd.DateOffset(days=1)
reference_date
Out[95]:
Timestamp('2022-08-04 00:00:00')
In [96]:
# Creating Recency column in days format
jewelry_rfm['Recency'] = jewelry_rfm['Recency_Date'].apply(lambda row: (reference_date - row).days)
jewelry_rfm.drop('Recency_Date', inplace = True, axis = 1)
jewelry_rfm = jewelry_rfm[['Recency', 'Frequency', 'Sales_Amount']]
jewelry_rfm.reset_index()
Out[96]:
Customer Recency Frequency Sales_Amount
0 692 231 1 862.50
1 959 114 1 5487.50
2 1084 228 1 2242.50
3 1359 232 2 1041.00
4 9775 95 1 3435.00
... ... ... ... ...
355650 31490631 1 1 355.00
355651 31490634 1 1 50.29
355652 31490635 1 1 203.00
355653 31490636 1 1 1032.00
355654 31490637 1 1 160.00

355655 rows × 4 columns

In [97]:
# Creating Monetary column
jewelry_rfm['Monetary'] = (jewelry_rfm['Sales_Amount']/jewelry_rfm['Frequency']).round(decimals=2)

# Dropping Sales Amount column
jewelry_rfm = jewelry_rfm.drop('Sales_Amount', axis = 1)
jewelry_rfm.reset_index(inplace=True)
jewelry_rfm.head()
Out[97]:
Customer Recency Frequency Monetary
0 692 231 1 862.5
1 959 114 1 5487.5
2 1084 228 1 2242.5
3 1359 232 2 520.5
4 9775 95 1 3435.0
In [98]:
# Checking Recency Range
a, b = jewelry_rfm.Recency.min(), jewelry_rfm.Recency.max()
a, b
Out[98]:
(1, 733)
In [99]:
# Checking Frequency Range
c, d = jewelry_rfm.Frequency.min(), jewelry_rfm.Frequency.max()
c, d
Out[99]:
(1, 1787)
In [100]:
# Checking maximum Monetary
e = jewelry_rfm.Monetary.max()
e
Out[100]:
340000.01
In [101]:
# Creating bins to check distribution
rec_range = [0, 90, 180, 270, 360, 450, 540, 630, 740]
rec_labels = ['<3','3-6','6-9','9-12','12-15','15-18','18-21','21+']

freq_range  = [0,  1,   2,   5,   15,  50,  1800]
freq_labels = ['1','2','3-5','6-15', '16-50','>50']

money_range  = [0, 100, 300, 500, 1000,  3000,  5000,  10000, 20000, 50000, 70000, 100000, 400000]
money_labels = ['<100','100-300','300-500','500-1k','1-3k','3-5k', '5-10k','10-20k','20-50k','50-70k','70k-1lc', '>1lc']


jewelry_rfm['R_Labels'] = pd.cut(jewelry_rfm['Recency'], bins = rec_range, labels = rec_labels)
jewelry_rfm['F_Labels'] = pd.cut(jewelry_rfm['Frequency'], bins = freq_range, labels = freq_labels)
jewelry_rfm['M_Labels'] = pd.cut(jewelry_rfm['Monetary'], bins = money_range, labels = money_labels)
jewelry_rfm.head()
Out[101]:
Customer Recency Frequency Monetary R_Labels F_Labels M_Labels
0 692 231 1 862.5 6-9 1 500-1k
1 959 114 1 5487.5 3-6 1 5-10k
2 1084 228 1 2242.5 6-9 1 1-3k
3 1359 232 2 520.5 6-9 2 500-1k
4 9775 95 1 3435.0 3-6 1 3-5k
In [102]:
# Checking the Distribution of the segmentations

fig = plt.figure(figsize=(14,14))

# Recency Plot
ax1 = plt.subplot2grid((2,3),(0,0))
ax = jewelry_rfm['R_Labels'].value_counts().plot(kind='bar', width = 0.4, edgecolor = None)
plt.title('Recency Bins Distribution', y = 1)
for p in ax.patches:
    width = p.get_width()
    height = p.get_height()
    x, y = p.get_xy() 
    ax.annotate(f'{height}', (x + width/2, y + height*1.02), ha='center')

# Frequency Plot
ax1 = plt.subplot2grid((2,3), (0,1))
ax = jewelry_rfm['F_Labels'].value_counts().plot(kind='bar', width = 0.4, edgecolor = None)
plt.title('Frequency Bins Distribution', y = 1)
for p in ax.patches:
    width = p.get_width()
    height = p.get_height()
    x, y = p.get_xy() 
    ax.annotate(f'{height}', (x + width/2, y + height*1.02), ha='center')

# Monetary Plot
ax1 = plt.subplot2grid((2,3), (0,2))
ax = jewelry_rfm['M_Labels'].value_counts().plot(kind='bar', width = 0.4, edgecolor = None)
plt.title('Monetary Bins Distribution', y = 1)
for p in ax.patches:
    width = p.get_width()
    height = p.get_height()
    x, y = p.get_xy() 
    ax.annotate(f'{height}', (x + width/2, y + height*1.02), ha='center')

fig.tight_layout()
plt.show()
In [103]:
# Getting the number of customers over different Bins Distribution
a = ((jewelry_rfm['R_Labels'].value_counts(normalize = True) * 100).round(decimals=2)).to_dict()
b = ((jewelry_rfm['F_Labels'].value_counts(normalize = True) * 100).round(decimals=2)).to_dict()
c = ((jewelry_rfm['M_Labels'].value_counts(normalize = True) * 100).round(decimals=2)).to_dict()
a, b, c
Out[103]:
({'18-21': 19.1,
  '6-9': 18.63,
  '21+': 11.78,
  '3-6': 11.11,
  '15-18': 10.94,
  '9-12': 10.4,
  '12-15': 9.05,
  '<3': 8.98},
 {'1': 83.25,
  '2': 12.38,
  '3-5': 3.98,
  '6-15': 0.36,
  '16-50': 0.02,
  '>50': 0.0},
 {'500-1k': 22.48,
  '100-300': 22.29,
  '1-3k': 21.08,
  '300-500': 16.02,
  '<100': 12.23,
  '3-5k': 3.2,
  '5-10k': 1.87,
  '10-20k': 0.61,
  '20-50k': 0.19,
  '50-70k': 0.01,
  '70k-1lc': 0.0,
  '>1lc': 0.0})
In [104]:
# Dropping Label columns of R, F and M as we no longer need them
jewelry_rfm = jewelry_rfm.drop(['R_Labels','F_Labels','M_Labels'],axis = 1)
jewelry_rfm.head()
Out[104]:
Customer Recency Frequency Monetary
0 692 231 1 862.5
1 959 114 1 5487.5
2 1084 228 1 2242.5
3 1359 232 2 520.5
4 9775 95 1 3435.0
In [105]:
# Function to create R value
def Rval(x):
    if x <= 180:
        return 4
    elif x <= 360:
        return 3
    elif x <= 540: 
        return 2
    else:
        return 1

# Function to create F value    
def Fval(x):
    if x <= 1:
        return 1
    elif x <= 4:
        return 2
    else:
        return 3   

# Function to create M value    
def Mval(x):
    if x <= 300:
        return 1
    elif x <= 750:
        return 2
    elif x <= 3500:
        return 3
    elif x <= 20000: 
        return 4
    else:
        return 5
In [106]:
# Adding R, F and M segment value columns to the existing dataset
jewelry_rfm['R'] = jewelry_rfm['Recency'].apply(Rval)
jewelry_rfm['F'] = jewelry_rfm['Frequency'].apply(Fval)
jewelry_rfm['M'] = jewelry_rfm['Monetary'].apply(Mval)
jewelry_rfm.head()
Out[106]:
Customer Recency Frequency Monetary R F M
0 692 231 1 862.5 3 1 3
1 959 114 1 5487.5 4 1 4
2 1084 228 1 2242.5 3 1 3
3 1359 232 2 520.5 3 2 2
4 9775 95 1 3435.0 4 1 3
In [107]:
# Checking the Distribution of the segmentations

fig = plt.figure(figsize=(12,12))

# Recency Plot
ax1 = plt.subplot2grid((2,3),(0,0))
ax = jewelry_rfm['R'].value_counts().plot(kind='bar', width = 0.4, edgecolor = None)
plt.title('Recency Distribution', y = 1)
for p in ax1.patches:
    width = p.get_width()
    height = p.get_height()
    x, y = p.get_xy() 
    ax.annotate(f'{height}', (x + width/2, y + height*1.02), ha='center')

# Frequency Plot
ax1 = plt.subplot2grid((2,3), (0,1))
ax = jewelry_rfm['F'].value_counts().plot(kind='bar', width = 0.4, edgecolor = None)
plt.title('Frequency Distribution', y = 1)
for p in ax1.patches:
    width = p.get_width()
    height = p.get_height()
    x, y = p.get_xy() 
    ax.annotate(f'{height}', (x + width/2, y + height*1.02), ha='center')

# Monetary Plot
ax1 = plt.subplot2grid((2,3), (0,2))
ax = jewelry_rfm['M'].value_counts().plot(kind='bar', width = 0.4, edgecolor = None)
plt.title('Monetary Distribution', y = 1)
for p in ax1.patches:
    width = p.get_width()
    height = p.get_height()
    x, y = p.get_xy() 
    ax.annotate(f'{height}', (x + width/2, y + height*1.02), ha='center')

fig.tight_layout()
plt.show()
In [108]:
# Getting the number of customers over different final Bins Distribution
a = ((jewelry_rfm['R'].value_counts(normalize = True) * 100).round(decimals=2)).to_dict()
b = ((jewelry_rfm['F'].value_counts(normalize = True) * 100).round(decimals=2)).to_dict()
c = ((jewelry_rfm['M'].value_counts(normalize = True) * 100).round(decimals=2)).to_dict()
sorted(a.items()), sorted(b.items()), sorted(c.items())
Out[108]:
([(1, 30.88), (2, 19.99), (3, 29.03), (4, 20.09)],
 [(1, 83.25), (2, 16.02), (3, 0.73)],
 [(1, 34.52), (2, 29.61), (3, 31.14), (4, 4.52), (5, 0.21)])
In [109]:
# Concatinating the 3 values as final RFM column
jewelry_rfm['RFM'] = jewelry_rfm.R.map(str) + jewelry_rfm.F.map(str) + jewelry_rfm.M.map(str)
jewelry_rfm
Out[109]:
Customer Recency Frequency Monetary R F M RFM
0 692 231 1 862.50 3 1 3 313
1 959 114 1 5487.50 4 1 4 414
2 1084 228 1 2242.50 3 1 3 313
3 1359 232 2 520.50 3 2 2 322
4 9775 95 1 3435.00 4 1 3 413
... ... ... ... ... ... ... ... ...
355650 31490631 1 1 355.00 4 1 2 412
355651 31490634 1 1 50.29 4 1 1 411
355652 31490635 1 1 203.00 4 1 1 411
355653 31490636 1 1 1032.00 4 1 3 413
355654 31490637 1 1 160.00 4 1 1 411

355655 rows × 8 columns

In [110]:
# Unique combinations of RFM column
jwl = jewelry_rfm.RFM.unique()
print(sorted(jwl))
['111', '112', '113', '114', '115', '121', '122', '123', '124', '125', '131', '132', '133', '134', '211', '212', '213', '214', '215', '221', '222', '223', '224', '225', '231', '232', '233', '234', '235', '311', '312', '313', '314', '315', '321', '322', '323', '324', '325', '331', '332', '333', '334', '335', '411', '412', '413', '414', '415', '421', '422', '423', '424', '425', '431', '432', '433', '434']
In [111]:
# Total Distinct Customer categories we got through RFM
jewelry_rfm.RFM.nunique()
Out[111]:
58

We have 58 unique categories of kinds of customers now we need to bucket them into one of the below segments to target them.

In [121]:
# Tagging Customers
jewelry_rfm['Segment'] = np.where(jewelry_rfm['RFM'].isin(['425','415','335','325','315','235','225','215','125','115']), 
                                  'VIP', 
                         np.where(jewelry_rfm['RFM'].isin(['434','433','424','423','334','333','324','323']), 'Best',
                         np.where(jewelry_rfm['RFM'].isin(['414','413','314','313','234','233','224','223','214','213','134','133','124','114',
                                                           '123','114','113']), 'Potential', 
                         np.where(jewelry_rfm['RFM'].isin(['432','431','422','421','332','331','322','321']), 'Promising',
                         np.where(jewelry_rfm['RFM'].isin(['232','231','222','221','212','211']), 'Winback',
                         np.where(jewelry_rfm['RFM'].isin(['132','131','122','121','112','111']), 'Churned',
                         np.where(jewelry_rfm['RFM'].isin(['412','411','312','311']), 'Rookies','Others')))))))
    
jewelry_rfm.head()
Out[121]:
Customer Recency Frequency Monetary R F M RFM Segment
0 692 231 1 862.5 3 1 3 313 Potential
1 959 114 1 5487.5 4 1 4 414 Potential
2 1084 228 1 2242.5 3 1 3 313 Potential
3 1359 232 2 520.5 3 2 2 322 Promising
4 9775 95 1 3435.0 4 1 3 413 Potential
In [122]:
# Plotting the Segmentation
jwl_seg = jewelry_rfm.groupby(['Segment'])['Customer'].count()
ax = jwl_seg.plot(kind='bar', figsize = (6,6), width = 0.5, edgecolor = None)

for p in ax.patches:
    width = p.get_width()
    height = p.get_height()
    x, y = p.get_xy() 
    ax.annotate(f'{height}', (x + width/2, y + height*1.02), ha='center')
In [123]:
# Percentage of customers among different segments
(jewelry_rfm['Segment'].value_counts(normalize = True) * 100).round(decimals=2)
Out[123]:
Potential    32.43
Rookies      26.05
Churned      20.88
Winback      12.39
Promising     4.81
Best          3.23
VIP           0.21
Name: Segment, dtype: float64

We divided the customers into 6 kind of segments and we have Potential Customers more among all the segments having 32% of distribution.